Model Selection

Zero-shot Transfer

# Zero-shot Transfer

Openvision Vit Base Patch16 160

OpenVision is a fully open-source, cost-effective family of advanced vision encoders for multimodal learning.

Multimodal Fusion

Vica2 Stage2 Onevision Ft

ViCA2 is a 7B-parameter multimodal vision-language model focused on video understanding and visual-spatial cognition tasks.

Transformers English

Blip Custom Captioning

BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation

Vit So400m Patch16 Siglip 256.v2 Webli

SigLIP 2 ViT model, containing only the image encoder part for image feature extraction, trained on the WebLI dataset.

Vit So400m Patch14 Siglip 224.v2 Webli

A Vision Transformer model based on SigLIP 2 architecture, designed for image feature extraction and pretrained on the webli dataset.

Image Classification

Vit Large Patch16 Siglip 384.v2 Webli

A vision Transformer model based on the SigLIP 2 architecture, designed for image feature extraction, pretrained on the webli dataset

Vit Large Patch16 Siglip 256.v2 Webli

Vision Transformer model based on SigLIP 2 architecture, designed for image feature extraction, trained on the webli dataset

Image Classification

Vit Giantopt Patch16 Siglip 384.v2 Webli

ViT image encoder based on SigLIP 2, designed for timm, suitable for vision-language tasks

Image Classification

Vit Base Patch16 Siglip Gap 256.v2 Webli

A ViT image encoder based on SigLIP 2, employing global average pooling with the attention pooling head removed, suitable for image feature extraction.

Multimodal Fusion

Vit Base Patch16 Siglip 384.v2 Webli

Vision Transformer model based on SigLIP 2, designed for image feature extraction, pre-trained on the webli dataset

Vit Base Patch16 Siglip 224.v2 Webli

ViT model based on SigLIP 2, focused on image feature extraction, trained on the webli dataset

Blip Image Captioning Large

A vision-language model pre-trained on the COCO dataset, excelling in generating accurate image descriptions

Convnext Large Mlp.clip Laion2b Ft Soup 320

ConvNeXt-Large image encoder based on CLIP architecture, fine-tuned on the LAION-2B dataset, supporting 320x320 resolution image feature extraction

Image Classification

Convnext Large Mlp.clip Laion2b Augreg

ConvNeXt-Large image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports visual feature extraction

Image Classification

CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.

Multimodal Fusion

Transformers English

CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.

Multimodal Fusion

Transformers English

CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.

Multimodal Fusion

Transformers English

Aimv2 Large Patch14 Native Image Classification

AIMv2-Large-Patch14-Native is an adapted image classification model, modified from the original AIMv2 model to be compatible with Hugging Face Transformers' AutoModelForImageClassification class.

Image Classification

Sam2.1 Hiera Small

SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.

Image Segmentation

Sam2.1 Hiera Large

SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting universal segmentation tasks through prompts.

Image Segmentation

Sam2 Hiera Base Plus

SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.

Image Segmentation

Cogflorence 2.1 Large

This model is a fine-tuned version of microsoft/Florence-2-large, trained on a subset of 40,000 images from the Ejafa/ye-pop dataset, with annotations generated by THUDM/cogvlm2-llama3-chat-19B, focusing on image-to-text tasks.

Transformers Supports Multiple Languages

AM-RADIO is a visual foundation model developed by NVIDIA Research, featuring an aggregated architecture for unified multi-domain representation, suitable for various computer vision tasks.

Image Segmentation

RADIO is a vision foundation model developed by NVIDIA Research, capable of unifying visual information across different domains for various vision tasks.

Image Segmentation

Cogflorence 2 Large Freeze

This is a fine-tuned version of the microsoft/Florence-2-large model, trained on a subset of 38,000 images from the Ejafa/ye-pop dataset, using CogVLM2-generated annotations, focusing on image-to-text tasks.

Transformers Supports Multiple Languages

This model is released under the Apache-2.0 license, with specific details currently unknown.

Large Language Model

Fashion Embedder

FashionCLIP is a vision-language model based on CLIP, specifically fine-tuned for the fashion domain, capable of generating universal fashion product representations.

Transformers English

Chronos T5 Small

Chronos-T5 is a pre-trained time series forecasting model based on a language model architecture. It converts time series into token sequences through quantization and scaling for training, making it suitable for various time series forecasting tasks.

ZoeDepth is a vision model for monocular depth estimation, fine-tuned on the KITTI dataset, capable of achieving zero-shot transfer for metric depth estimation.

Web Register Classification Multilingual

A multilingual web register classifier based on fine-tuned XLM-RoBERTa-large, supporting text classification tasks in 100 languages.

Text Classification

Transformers Supports Multiple Languages

Nllb Uzbek Russian

This is an open-source model based on the Apache-2.0 license; specific functionalities depend on the actual model*

Large Language Model

Bert Large Maths

Open-source model under Apache-2.0 license (specific details unavailable)

Large Language Model

Image Captioning With Blip

BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation, supporting both conditional and unconditional text generation

Swissmedical Faqs Classification V1

Open-source model based on Apache-2..0 license, specific functionalities depend on the actual model type

Large Language Model

FedericoDamboreana

Valencearousalvam

No model information available

Large Language Model

Image Caption Large Copy

BLIP is an advanced vision-language pretraining model, excelling in image captioning tasks by effectively utilizing web data through guided annotation strategies

BLIP is an advanced vision-language pretrained model, excelling in image captioning tasks, capable of generating accurate natural language descriptions based on image content.

Table Detection Detr

No specific description provided for this model; unable to generate an introduction

Large Language Model

Blip Image Captioning Large

BLIP is a unified vision-language pretraining framework, excelling in image caption generation and understanding tasks, efficiently utilizing web data through guided annotation strategies

Llava 7B Lightening V1 1

LLaVA-Lightning-7B is a multimodal model based on LLaMA-7B, achieving efficient vision-language task processing through delta parameter tuning.

Large Language Model

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase